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IA @ iPRES Schedule 


1. 09:00 - Breakfast & Networking 

2. 09:30 - Welcome & Introductions f) 

3. 09:45 - Updates from Internet Archive 

4. 10:15 - Partner Talks and Derrfds 

5. "11:00 - Break & Socializing 


6.^11:30 - Partner & Internet Archive 


Talks 


and 


De 


mos 


ARCHIVE 


7. 12:30 - Wrap-up & Special Announcements \ * 

8. 13:00 - Lunch Provided by thelrrternet Archives 

9. 14:00 - Enjoy Amsterdam g I 

A 


ARCHIVE-IT 


iPRES 2019 


INTERNET 


Meeting Format / Goals 


1. Informal anil Social 

2. Conversational 


3. What IA is up to 

4. What partners & friends are up to I 

j| i I- 

5. Ideas & feedback orUA services, features 
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IA Staff + Quick Talks 

• Jefferson Bailey, Director, Web Archiving & Data Services 
o Talk high-level overview of services / activities 

• Kyrie Whitsett, Program Officer, WA & DS 

Talk high-le^el Overview of Archive-^ || 

• Helge Holzmann, Data Engineer, WA & DS 

©•. Talk a few specific web/data projects u ' | 
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IA Staff + Quick Talks 

• Jefferson Bailey, Director, Web Archiving & Data Services 


o Landscape ■piojyghts 
Services 
<3" Strategic 

< 3 k Web & Data Services 
o Cloud Services." 


plates 




activities 
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Landscape Thoughts 


Wide 

• Sustainability Questions 

• Economic Forecasts 

• Cloud, Hosted, E2E 


c T 
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Focused: n 

• '“Secondary” Outputs 

• Degrees of Distribution 

• Capture/Replay/Affiliated 

Ia 
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Archive-lt & Web/Data Group 

Web Archiving, Data Services, Web Dev, Indexing/Access 


Subscription & 


'6 


A / 1 



_Services 

700 Institutional 

o 60% .edu (85+% ARLs; 94% in U.S.) 
“ " tional LAMs/Govs/NGO 

b + dala^etyear « Q - 
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• ~20+ global team engineering/program 
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Strategic Activities 
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• Community Building 
^ o Public/Art/Historical Libraries; non-profits 

• Service Integrations 

^ Hflf Ff r ** 

o Service adoptions, joint services 

• Product Diversification > | 

o Portfolio-ized and complementary 

• Core Technologies - 
o More investment in core stack 

- if M ^ II 
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Strategic Activities 

Improve capture and replay 
Diversify Services and Users 
Automate Client Managemen 


oss-Service Optimization 
Growth / Rem 
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Scholarly Archiving One-Liner 


Build a complete, use-oriented, highly-available 


archive and knowledge graph of every 


ARCHIVE 


X 1 P 

publicly-accessible scholarly oufput + identifier 

metadata and full-text, linked with versions, 

^ 4I §§* ^ 

secondary outputs (data/blogs/etc) with a priority 

on long-tail, at-risk publications - all with API-first 
machine-readable distributecTand bulk access;?**"' 

I) 
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Goals/Concepts of this Work 


Automation & web harvesting as technical 
approaches for scaling archiving scholarship 
Leverage existing corpus via ID and ETL/augment 
Build API-first, “biblio” access, preservation, and 
fesearch/knciwledge services onth&£olle<ftion 


Address critical need and collaborative opportunity 
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for perpetual 
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access to open knowledge 
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Methodologies 

Top-down: 

o use lists/IDs/etc to target harvesting and 
v E scholarship with MD & secondaries 
Middle-sideways ; . H ' v s 4 

° |§ te 9 ratdw/;(OA) pub systems/platforms 
BoUom-uprjj ^ ~ | 

o ML/algorithms to identify extant yVorks, < 


associate 
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quality of preservation, identify secondaries 
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Scrawler 




FatCat 


Cluster 

(Hadoop) 


Search 

(Elastic) 


rvi 


Processing 

(Grobid) 


& ^ 


db 

(Postgres) 





IA Web 
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IA General 
Archive 


Metadata 

Store 


IA Petabox Repository 
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Data Center Mirrors | | Data Center Mirrors 


Users 
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Framework for Distributed Research Services 
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drives 








Custodian 

hardware/cloud 



Comm Cloud 
AWS, Goog, 
Azure, Wolfram, 
etc 




Academic HPC 


Local+ 

tooling/analytics 
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Seeds, WARCs, Derivative Datasets, Publications, Research Data 
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Non-Profit Community Cloud in Practice 
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Identifies 
or Creates 
Collection 


Dataset 
parsing and 
augmenting 


Selects or 
Crawls 


Uses 

WASAPI to 
identify 
WARCs 



WASAPI to 
transfer 
datasets 


WASAPI to 
submit 
dataset job 


Data Vis & 
Publication 


Fame & 
profit 
(or tenure) 


ARCHIVE-IT 




S, 


iPRES 2019 

















INTERNET 


Custom Web Services 








BIBLIOTHEK 

Contact A-Z Donators / Funding 

body Data protection Imprint 

| Help My Account 

Deutsch 



Online Publications 


Standardisation 


German Music 


German Web Archive Search 
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» http://www.hundebabies.de/neues/puppy.au 
H)JJ WWW hundebables de neues puppy au 


4 » 

4 » 


4 » 

4 » 


http://www.tu- 


http://www.zHli 


nil 


I http://irish-net.de/files/irish-dance-lernen-gabriell.jpg 

I irish net de files irish dance lernen gabriell jpg 


I http://irish-dance-school-bremen.de/attachments/Logo/Irish-Dance-Mobil.JPG?template=ger 

[ irish dance school bremen de attachments Logo 


I http://bayreuth.bayern-online.de/uploads/pics/dance_irish_dance.jpg 

bayreuth bayem online de uploads pics dance irish dance jpg 


Web Archive Search 

shamrock 

J E 

Web Content 

Craw 

All Web 

All 


^ » http://heinzinge.de/heinz%20irish%20dance.mp3 

heinzinge de heinz irish dance mp3 

http://divine-dance.de/Erica-Irish-l.gif 

divine dance de Erica Irish 1 gif 

I ff l W H l :tps ://irish-days. de/files/2016/06/Ceili-Irish-Dance-1100x400. jpg 

Bm/Ol irish days de files 2016 06 Ceili Irish Dance 1100x400 jpg 


'abstracts": (] 
"refs ': [ ... ], 
"contribs”: | 


"role": “author" 

> 

], 

"language": “en", 

"publisher": "Public Library of Science", 
"pages": "el24", 

"volume": "2", 

“doi": “10.1371/journal.pmed.0020124", 

"release_date": "2OO5-O8-30TOO:00:00Z", 
“release_status“: "published", 

"release_type" : "journal-article", 
container_id“ : "aaaaaaaaaaaaaeiraaaaaaaaam", 
"files”: [ 


{ 


"mimetype": "application/pdf ", 
"urls": [ 


"url" : "http://journals.plos.org/plosmedicine/article/file?id*10.1371/j J 
"rel": "publisher" 


) 


), 

"sha256 * : "ffcl005680cb620eec4c913437dfabbf 311b535cfel6cbaeb2f aeclf92afc362", 
"md5": "f4de91152c7ab9fdc2al28f962faebff " , 

"shal": "3f242al92acc258bdfdbl51943419437f440c313", 

"size": 255629, 

"revision": "00000000-0000-0000-3333-fff000000003", 

"ident": "aaaaaaaaaaaaamztaaaaaaaaam", 
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http://www.usetronik.de:80/sounds/mp3/irish/Irish%20Rovers%20-%20Lord%20of%20the%20Dance.mp3 

www usetronik de 80 sounds mp3 irish Irish Rovers Lord of the Dance mp3 


http://www.extralime.ie/mi 

www extratime ie media e 

1 http://berlin-irish-dance-co.de/img_lib/main.jpg 

L I berlin irish dance co de img lib main jpg 

http://www.keoghs.ie/wp-cc 

www keoghs ie wp content uploads 2017 05 Shamrock Ingredient png 
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http://www.classichits.ie/wp-content/uploads/2017/11/irish-sharnrock-flag-chiistopher-rowlands-360x240.jp 

www dassichits ie wp content uploads 2017 11 irish shamrock flag Christopher rowlands 360x240 jpg 


http://139.162.250.120/clarefm/wp-content/uploads/sites/17/royal-british-legion-shamr< 

badge-80x60.jpg 

139 162 250 120 clarefm wp content uploads sites 17 royal british legion shamrock ai 
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Non-Profit Community Cloud in Character 


ARCHIVE 


• Service models not membership models 

• Product precedes consensus 

• Better over best / operational over aspirational 

• Flexible service and sustainability models 

• Community-driven, service-managecK || - 
•^Service models not membership models 

Didpres on loftTundir^ls^^rirne^ Tl v ^ 

• Service s mplicity, local integrations, cost tiers 

Ad 4--.. ; 
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Non-Profit Community Cloud in Practice 




Partial offlinejfearline copied ii>4 continents. 


ARCHIVE 


Location: 3 main data centers, geogra 
distributed. 5 online data centers in 3 continents. 

• Replication: Frpm 6x to 2x across data centers and 

environments,Jvarious determining factors. | - 

• Migration/derivation Both automated and on-demand. 

• Verification: From monthly to annual to operational. 
Multiple cryptographies. Various determining factors. 
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Non-Profit Community Cloud in Practice 


Open APIs for transfer/distributed pres (IA-S3, WASAPI) 
Open data mirrors (CDL, other science data services) 
LOCKSS pilots + APT and other repo/pres pilots 
Integration in discovery JUnpaywall, WorldCat etc) 
Ingest of publisher and PID feeds (DOI&ilSSN^ etc)^^' 


Preservation/stbr 



J 



open data publishers Z 

* - -x U , 
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SLA-ish “storage as a service” contracts 
Decentralized pilot projects for redistribution 
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THANK YOU 
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